Clustering of solutions in the random satisfiability problem. 
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Using elementary rigorous methods we prove the existence of a clustered phase in the random 
A'-SAT problem, for K > 8. In this phase the solutions are grouped into clusters which are far 
away from each other. The results are in agreement with previous predictions of the cavity method 
and give a rigorous confirmation to one of its main building blocks. It can be generalized to other 
systems of both physical and computational interest. 
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Constraint satisfaction problems (CSPs) provide one 
of the main building blocks for complex systems stud- 
ied in computer science, information theory and statisti- 
cal physics, and may even turn out to be important in 
the statistical studies of biological networks. Typically, 
they involve a large number of discrete variables, each 
one taking a finite number of values, and a set of con- 
straints: each constraint involves a few variables, and 
forbids some of their joint assignments. A simple exam- 
ple is the g-coloring of a graph, where one should assign 
to each vertex of the graph a color in {1, . . . , q}, in such 
a way that two vertices related by an edge have differ- 
ent colors. In the case q — 2, this is nothing but the 
zero temperature limit of an antiferromagnetic problem, 
which is known to display a spin-glass behaviour when 
the graph is frustrated and disordered. CSPs also ap- 
pear naturally in the studies of structural glasses Q] and 
rigidity percolation Q. 

Given an instance of a CSP, one wants to know whether 
there exists a solution, that is an assignment of the vari- 
ables which satisfies all the constraints (e.g. a proper 
coloring). When it exists the instance is called SAT, and 
one wants to find a solution. Most of the interesting CSPs 
are NP-complete: in the worst case the number of oper- 
ations needed to decide whether an instance is SAT or 
not is expected to grow exponentially with the number 
of variables. But recent years have seen an upsurge of 
interest in the theory of typical-case complexity, where 
one tries to identify random ensembles of CSPs which 
are hard to solve, and the reason for this difficulty. Ran- 
dom ensembles of CSPs are also of great theoretical and 
practical importance in communication theory: some of 
the best error correcting codes (the so-called low density 
parity check codes) are based on such constructions 

The archetypical example of CSP is Satisfiability 
(SAT). This is a core problem in computational complex- 
ity: it is the first one to have been shown NP-complete || , 
and since then thousands of problems have been shown 



to be computationally equivalent to it. Yet it is not so 
easy to find difficult instances. The main ensemble which 
has been used for this goal is the random AT-satisfiability 
(AT-SAT) ensemble. The variables are N binary variables 

- Ising spins — a = {ct;} 6 {— 1, 1} N . The constraints 
are called AT-clauses. Each of them involves K distinct 
spin variables, randomly chosen with uniform distribu- 
tion, and it forbids one configuration of these spins, ran- 
domly chosen among the 2 K possible ones. A set of M 
clauses defines the problem. This corresponds to gen- 
erating a random logical formula in conjunctive normal 
form, which is a very generic problem appearing in logic. 
A'-SAT can also be written as the problem of minimizing 
a spin-glass-like energy function which counts the num- 
ber of violated clauses and in this respect random AT-SAT 
is seen as a prototypical diluted spin-glass 6j. Here we 
shall keep to the most interesting case AT > 3 (for AT = 2 
the problem is polynomial). 

In the recent years random AT-SAT has attracted much 
interest in computer science and in statistical physics 
0, H & El • The interesting limit is the thermodynamic 
limit N — > co, M — > oo at fixed clause density a = M/N. 
Its most striking feature is certainly its sharp threshold. 
It is strongly believed that there exists a phase transi- 
tion for this problem: Numerical and heuristic analytical 
arguments are in support of the so-called Satisfiability 
Threshold Conjecture: 

There exists a c {K) such that with high probability: 

- if a < a c (K), a random instance is satisfiable ; 

- if a > a c (K), a random instance is unsatisfiable. 

In all this paper, 'with high probability' (w.h.p.) means 
with a probability going to one in the N — > oo limit. Al- 
though this conjecture remains unproven, Friedgut has 
come close to it by establishing the existence of a non- 
uniform sharp threshold ■ A lot of efforts have been 
devoted to understanding this phase transition. This is 
interesting both from the physics point of view, but also 
from the computer science one, because the random in- 
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stances with a close to a c are the hardest to solve. The 
most important rigorous results so far are bounds for 
the threshold a c (K). The best upper bounds were de- 
rived using first moment methods Il2l 

0- 

Lower bounds 

can be found by analyzing some algorithms which find 
SAT assignments 0, but recently a new method, 
based on second moment methods, has found better 
and algorithm- independent lower bounds 0,0]. Using 
these bounds, it was shown that a c (K) scales as 2 K ln(2) 
when K — > oo. 

On the other hand, the cavity method, which is a pow- 
erful tool from the statistical physics of disordered sys- 
tems [l^, i s claimed to be able to compute the exact 
value of the threshold [lfl Efl EH ], giving for instance 
a e (3) ~ 4.2667... It is a non-rigorous method but the self- 
consistency of its results have been checked by a 'stabil- 
ity analysis' 0, H3, 0] , and it also leads to the develop- 
ment of a new algorithmic strategy, 'survey propagation', 
which can solve very large instances at clause densities 
which are very close to the threshold (e.g. N — 10 6 and 
a = 4.25). 

The main hypothesis on which the cavity analysis of 
random A-satisfiability relies is the existence, in a region 
of clause density [ay, a c ] close to the threshold, of an 
intermediate phase called the 'hard-SAT' phase. In this 
phase the set S of solutions (a subset of the vertices in 
theiV-dimensional hypercube) is supposed to split into 
many disconnected clusters S = S\ U S2 U . . . . If one 
considers two solutions X, Y in the same cluster Sj , it is 
possible to walk from X to Y (staying in S) by flipping at 
each step a finite numbers of spins. If on the other hand 
X and Y are in different clusters, in order to walk from 
X to Y (staying in S), at least one step will involve an 
extensive number (i.e. oc N) of spin flips. This clustered 
phase is held responsible for entrapping many local search 
algorithms into non-optimal metastable states [HI • This 
phenomenon is not exclusive to random A-SAT. It is also 
predicted to appear in many other hard satisfiability and 
optimization problems such as Coloring 0, 0] or the 
Multi-Index Matching Problem [27]], and corresponds to 
a 'one step replica symmetry breaking' (1RSB) phase in 
the language of statistical physics. It is also a crucial 
limiting feature for decoding algorithms in some error 
correcting codes [2^. So far, the only CSP for which 
the existence of the clustering phase has been established 
rigorously is the simple polynomial problem of random 
XOR-SAT 00. In other cases it is an hypothesis, the 
self-consistency of which is checked by the cavity method. 

In this paper we provide rigorous arguments which 
show the existence of the clustering phenomenon in ran- 
dom if- SAT, for large enough K, in some region of a 
included in the interval [ctd(K),a c (K)] predicted by the 
statistical physics analysis. Our result is not able to con- 
firm all the details of this analysis but it provides strong 
evidence in favour of its validity. 

Given an instance F of random A-satisfiability, we 



define a SAT-x-pair as a pair of assignments (a, r) € 
{ — 1,1} 2N , which both satisfy F, and which are at a 
Hamming distance d aT = 5^ =1 (1 — <JiTi)j1 specified by 
x as follows: 

d aT e [Nx-e(N),Nx + e(N)] (1) 

Here x is the normalized distance between the two 
configurations, which we keep fixed as N and d go 
to infinity. The resolution e(N) must be such that 
lirriAr^oo e(N)/N = 0, but its precise form is unimpor- 
tant for our large N analysis. One can choose for instance 
e(N) = y/N. 

We call x-satisfiable a formula for which such a pair 
of solutions exists. Our study mimicks the usual steps 
which are taken in rigorous studies of A-SAT, but taking 
pairs of assignments at a fixed distance instead of single 
assignments. 

We first formulate the x- Satisfiability Threshold Con- 
jecture: 

For all K > 2 and for all x, < x < 1, there exists an 
a c (K,x) such that w.h.p.: 

- if a < a c (K,x), a random K-CNF is x-satisfiable; 

- if a > a c (K,x), a random K-CNF is x-unsatisfiable, 
which generalizes the usual satisfiability threshold con- 
jecture (obtained for x = 0). We shall find explicitly 
below two functions, o;lb(A, x) and o.ub{K,x) which 
give lower and upper bounds for a for ^-satisfiability at 
a given value of K. Numerical computations of these 
bounds show that a(K, x) is non monotonous as a func- 
tion of x for K > 8, as illustrated in FigQ This in turn 
shows that, for K large enough and in some well chosen 
interval of a below the satisfiability threshold, SAT-x- 
pairs exist for x close to (a and f in the same cluster) 
and x close to .5 (a and f in different clusters), but there 
is an intermediate x region where they do not exist. Fig^ 
shows an explicit example of this scenario for a particular 
value of a. 

In what follows we first establish a rigorous and ex- 
plicit upper bound using a simple first moment method. 
Subsequently, we provide a (numerical) lower bound us- 
ing a second moment method 0, 0] . Both results are 
based on elementary probabilistic techniques which could 
be generalized to other physical systems or random com- 
binatorial problems. 

Upper bound: the first moment method. We use the 
fact that, when Z is a non-negative random variable: 

P(Z > 1) < E(Z) . (2) 

Given a formula F, we take Z(F) to be the number of 
pairs of solutions at fixed distance (with resolution e(N)): 

Z(F) =E 5 (^T^) 6 & * £ ' ( 3 ) 
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FIG. 1: Lower and Upper Bounds for the ^-satisfiability 
threshold a c (K = 8,x). The upper curve is obtained by 
the first moment method. Above this curve there exists no 
SAT-a;-pair, w.h.p.. The lower curve is obtained by the sec- 
ond moment method. Below this curve there exists a SAT- 
a;-pair w.h.p.. For values of a lying between 164.735 and 
170.657, these bounds guarantee the existence of a cluster- 
ing phenomenon. The horizontal line gives an example of 
this phenomenon for a = 166.1. We exhibit the successive 
phases as one varies x: a;-SAT regions are represented by a 
thick solid line, a:-UNSAT regions by a wavy line, and "don't 
know" regions by a dotted line. The a;-SAT region near x = 
corresponds to intra-cluster pairs, whereas the a;-SAT region 
around x — .5 corresponds to inter-cluster pairs. In this ex- 
ample, the intermediate a:-UNSAT region around x ~ .13 
shows the existence of a "gap" between clusters. We recall 
that the best refined lower and upper bounds for the satis- 
fiability threshold a c (K — 8) from [l3l Il7| are respectively 
173.253 and 176.596. The cavity prediction is 176.543 HJ]. 



entropy function. This gives the upper bound: 

In 2 + Ha (a;) 



a UB {K,x) 



ln(l 



n-K 



+ 2~ K (l-x) K ) 



(5) 



Lower bound: the second moment method. We use 
the fact that, when Z is a non-negative random variable: 



P(Z > 0) > 



E(Z) 2 
E(Z2)' 



(6) 



However using this formula with Z equal to the number 
of solutions fails, and one must instead use a weighted 
sum ^(|. We follow the strategy recently developed in 
|l7j. which we generalize to SAT-ir-pairs by taking: 



(7) 



W(a, r, c) is a weight associated with the clause c, given 
the couple (a, f), and is defined as follows: Suppose that 
c is satisfied by n a among the K cr-variables involved 
in c, and by n T among the K r-variables. Call no the 
number of common values between the a- and r-variables 
involved in c. Then define: 



W(a,T,c) = 



otherwise. 



Note that with this definition of Z, the choice A = 1, v = 
1 simply yields the number of solutions J2J). 

Let us now compute the first two moments of Z ( 31]): 



where S(F) is the set of solutions to F, Throughout 
this paper S(A) is an indicator function, equal to 1 if the 
statement A is true, and to otherwise. Since Z(F) > 1 
is equivalent to U F is x-satisfiable" , Q gives an up- 
per bound for the probability of ^-satisfiability. The 
expected value of the double sum over the choice of a 
random F is: 



E(Z(F)) 



(4) 



We have used 5(a,r € S(F)) = ]\J(a,T e 5(c)), 
where c denotes the clauses, and the fact that clauses are 
drawn independently. The expectation E [S(a 7 f £ S(c))] 
is equal to: 1 — 2 1_x + 2~ A (1 — x) K (there are only two 
realizations of the clause among 2 K that do not satisfy 
c unless the two configurations overlap exactly on the 
domain of c). 

In the thermodynamic limit, ln~E(Z(F))/N — > 
$i(x, a), where: 

<f>i(x,a) =ln2 + H 2 (x) +a\n[l-2- K (2- (l-x) K )] , 
where H2(x) = — xhix— (1 — x) ln(l — x) is the two-state 



E(Z) = 2 



N 
Nx 



f[ X ' V \x) 



(9) 



where f[ Ku) {x) = F,(W(a,T,c)) can be calculated 
by simple combinatorics (via multinomial sums). To 
compute E(Z 2 ), we sum over four spin configurations 
cr, f , a' , t' . Symmetry allows to fix Ui — 1. Let Na(t, s, t') 
be the number of sites i such that ti = t, a[ = s' 
and t[ — t' (where t,s,t' G {±1}). It turns out that 
the term of the sum depends only on these 8 numbers 
a(±l, ±1, ±1). We collect them into a vector a and get: 



E(Z 2 ) = 2 



/f^)(a) 



A I 



(10) 

where /^'^(a) = E(W(a, f, c)W{a', f', c)) can be cal- 
culated by simple combinatorics in the same way as f\. 
The integration set V is a 5-dimensional simplex taking 
into account the normalization ^2 t , t , a(t, s' , t') = 1 and 
the two constraints: d aT /N ~ x, d a i T i/N ~ x. 

A saddle point evaluation of ea. ljTU|l gives, for N 00: 



E(Z) 2 



> C exp( 



-N max $ 2 (a)), 

a.EV 



(11) 
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where Co is a constant depending on K and x, and: 
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$(a) = i? 8 (a)-ln2-2iJ 2 (x)+aln/ 2 (A ^ ) (a)-2aln/ 1 (A ' ,y) (x), 

(12) 

with Hg(a) = — s , t , <z(i, s',t') In <z(i, s',t'). In gen- 
eral max ae y $(a) is non-negative and one must choose 
appropriate weights W(a, f, c) in such a way that 
max ae y $(a) = 0. We notice that at the particular 
point a* where (<?,t) is uncorrelated with (cr',f'), we 
have $(a*) = 0. We fix the parameters A and [i defining 
the weights (jHJ in such a way that a* be a local maxi- 
mum of $. This gives two algebraic equations in A and 
v which have a unique solution A > 0, v > 0. Fixing A 
and v to these values, oilb is the largest value of a such 
that the local maximum at a* is a global maximum, i.e. 
such that there exists no a 6 V with $(a) > 0: 



cxlb(K,x) = inf 



ln2 + 2# 2 (x) -iT 8 (a) 



aevin/f'^a) 



21n/ 1 (A ^ ) (x) : 



(13) 



We devised several numerical strategies to evaluate 
otLB{K,x). The implementation of Powell's method 
starting from each point of a grid of size Af 5 (AT — 
10, 15, 20) on V turned out to be the most efficient and 
reliable. The results are given by FigQ]for K = 8, the 
smallest K such that the clustering conjecture is con- 
firmed. We found a clustering phenomenon for all the 
values of K > 8 that we checked, and in fact the relative 
difference [auB{K,x) — aLB(K,x)]/ oilb{K,x) seems to 
go to zero at large K. 

We have shown a simple probabilistic argument which 
shows rigorously the existence of a clustered 'hard-SAT' 
phase. The prediction from the cavity method is in fact 
a weaker statement. It can be stated in terms of the 
overlap distribution function P(x), which is the prob- 
ability, when two SAT-assignments are taken randomly 
(with uniform distribution), that their distance is given 
by x. The cavity method finds that this distribution has 
a support concentrated on two values: a large value Xi, 
close to one, gives the characteristic 'radius' of a clus- 
ter, a smaller value x gives the characteristic distance 
between clusters. This does not imply that there exists 
no pair of solution for values of x distinct from xq, x±: it 
just means that such pairs are exponentially less numer- 
ous than the typical ones. Our rigorous result shows that 
in fact there exists a true gap in x, with no SAT-x-pairs, 
at least for K > 8. More sophisticated moment compu- 
tations might allow to get some results for smaller values 
of K . Still the conceptual simplicity of our computation 
makes it a useful tool for proving similar phenomena in 
other systems of physical or computational interests, like 
for instance the graph-coloring (antiferromagnetic Potts) 
problem. 

This work has been supported in part by the EC 
through the network MTR 2002-00319 'STIPCO' and the 



[1] M. Seltitto, G. Biroli and C. Toninelli, Europhys. Lett. 

69, 496 (2005) 

[2] J. Barre et al., cond-mat/0408385 

[3] Robert G. Gallagher. Information Theory and Reliable 
Communication. Wiley, New York, 1968. 

[4] David J.C. MacKay. Information Theory, Inference & 
Learning Algorithms. Cambridge University Press, Cam- 
bridge, 2002. 

[5] Stephen Cook. The complexity of theorem proving pro- 
cedures. In Proceedings of the Third Annual ACM Sym- 
posium on Theory of Computing, pages 151-158, 1971. 

[6] R. Monasson, R. Zecchina, Phys. Rev. E 56, 1357 (1997). 

[7] T. Hogg, B.A. Huberman, C. Williams, C. (eds), Artifi- 
cial Intelligence 81 I & II (1996). 

[8] Special Issue on NP-hardness and Phase transitionls, 
edited by O. Dubois, R. Monasson, B. Selman and R. 
Zecchina, Theor. Comp. Sci. 265, Issue: 1-2 (2001). 

[9] S. Kirkpatrick, B. Selman, Science 264, 1297 (1994). 
[10] R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, 

and L. Troyanski, Nature 400, 133 (1999). 
[11] E. Friedgut, Journal of the A. M.S. 12, 1017 (1999). 
[12] L. M. Kirousis, E. Kranakis, D. Krizanc. Technical report 
TR-96-09, School of Computer Science, Carleton Univer- 
sity, 1996. 

[13] O. Dubois, Y. Boufkhad, J. Algorithms 24, 395 (1997). 
[14] M.-T. Chao, J. Franco, Inform. Sci. 51(3), 289 (1990). 
[15] A. M. Frieze, S. Suen, J. Algorithms 20, 312 (1996). 
[16] D. Achlioptas, C. Moore, Proc. Foundations of Computer 
Science (2002). 

[17] D. Achlioptas, Y. Peres, Journal of the AMS 17 947 
(2004). 

[18] M. Mezard, G. Parisi, G. , J. Stat. Phys. Ill (2003). 
[19] M. Mezard, R. Zecchina, Phys. Rev. E 66, 056126 (2002). 
[20] M. Mezard, G. Parisi, R. Zecchina, Science 297, 812 

(2002) . 

[21] S. Mertens, M. Mezard, R. Zecchina, Threshold values of 
Random K-SAT from the cavity method, cs.CC/0309020 

(2003) , to appear in Random Structure and Algorithms. 
[22] A. Montanari, F. Ricci-Tersenghi, Eur. Phys. J. B 33, 

339 (2003). 

[23] A. Montanari, G. Parisi, F. Ricci-Tersenghi, J. Phys. A 

37, 2073 (2004). 
[24] G. Semerjian, R. Monasson, Proceedings of the SAT 2003 

conference, E. Giunchiglia and A. Tacchella eds., Lecture 

Notes in Computer Science (Springer) 2919, 120 (2004). 
[25] R. Mulet, A. Pagnani, M. Weigt, R. Zecchina, Phys. Rev. 

Lett. 89, 268701 (2002). 
[26] A. Braunstein, R. Mulet, A. Pagnani, M. Weigt, R. 

Zecchina, Phys. Rev. E 68, 036702 (2003). 
[27] O. C. Martin, M. Mezard, O. Rivoire, Phys. Rev. Lett. 

93, 217205 (2004). 
[28] A. Montanari, Eur. Phys. J. B 23, 121 (2001). 
[29] M. Mezard, F. Ricci-Tersenghi, R. Zecchina, J. Stat. 

Phys. Ill, 505 (2003). 
[30] S. Cocco, O. Dubois, J. Mandler, R. Monasson, Phys. 

Rev. Lett. 90, 047205 (2003). 
[31] M. Mezard, T. Mora, R. Zecchina, in preparation. 



